Machine learning training architecture for programmable devices

ABSTRACT

A programmable device may be configured to support machine learning training operations using matrix multiplication circuitry. In some embodiments, the multiplication is implemented on a systolic array. The systolic array includes an array of processing elements, each of which includes hybrid floating-point dot-product circuitry.

This application is a continuation of U.S. patent application Ser. No.16/585,857, filed Sep. 27, 2019, which claims the benefit of provisionalpatent application No. 62/824,797, filed Mar. 27, 2019, each of which ishereby incorporated by reference herein in their entirety.

BACKGROUND

This invention relates generally to integrated circuits and, inparticular, to programmable integrated circuits configured to supportmachine learning.

Programmable integrated circuits such as programmable logic devices(PLDs) include configurable logic circuitry having look-up tables (LUTs)and adder based logic that are designed to allow a user to customize thecircuitry to the user's particular needs. In addition to thisconfigurable logic, PLDs also include programmable interconnect orrouting circuitry that is used to connect the inputs and outputs of theconfigurable logic blocks. The combination of this programmable logicand routing circuitry is referred to as “soft” logic.

Besides soft logic, PLDs may also include specialized processing blocksthat implements specific predefined logic functions and thus cannot beconfigured by the user. Such specialized processing blocks may include aconcentration of circuitry on a PLD that has been partly or fullyhardwired to perform one or more specific tasks, such as a logical or amathematical operation. One particularly useful type of specializedprocessing block that has been provided on PLDs is a digital signalprocessing (DSP) block. A conventional DSP block includes two 18-by-18multipliers, which can be combined with other internal circuitry to forma larger 27-by-27 multiplier. The 27-by-27 multiplier is used as part ofan IEEE 754 single precision floating-point multiplier, which requires24 bits of precision.

Recent developments in artificial intelligence such as advancements inmachine learning and deep learning involve training and inference, whichhave necessitated a much higher density of multiplications. In contrastto inference which uses relatively simpler math and dataflow, machinelearning training involves more complex large matrix multiplicationsthat require access to external memory. Access to external memory is,however, limited by external memory bandwidth and internal bandwidthmanagement constraints. Using traditional floating-point multipliers tosupport complex training operations on PLDs may be insufficient. Usingtoo much soft logic in conjunction with the traditional floating-pointmultipliers to support training also tends to create fitting and timingclosure problems.

It is within this context that the embodiments described herein arise.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of an illustrative programmable integrated circuitin accordance with an embodiment.

FIG. 2 is a diagram of illustrative machine learning training circuitryin accordance with an embodiment.

FIG. 3 is a diagram of a systolic array processing element in accordancewith an embodiment.

FIG. 4 is a diagram showing an illustrative matrix allocation to anarray of processing elements in accordance with an embodiment.

FIG. 5A is a diagram of illustrative hybrid floating-point 16-elementdot-product circuitry in accordance with an embodiment.

FIG. 5B is a diagram of an illustrative 2-element dot-product circuit inaccordance with an embodiment.

FIG. 6A is a diagram of a classical floating-point multiplier.

FIG. 6B is a diagram of an illustrative customized floating-pointmultiplier within the 2-element dot-product circuit shown in FIG. 5B inaccordance with an embodiment.

FIG. 7A is a diagram of a classical floating-point adder.

FIG. 7B is a diagram of an illustrative floating-point adder within the2-element dot-product circuit shown in FIG. 5B in accordance with anembodiment.

FIG. 7C is a diagram of an illustrative customized floating-point adderin a first adder stage of the hybrid floating-point dot-productcircuitry of FIG. 5A in accordance with an embodiment.

FIG. 7D is a diagram of an illustrative customized floating-point adderin a second adder stage of the hybrid floating-point dot-productcircuitry of FIG. 5A in accordance with an embodiment.

FIG. 7E is a diagram of an illustrative customized floating-point adderin a third adder stage of the hybrid floating-point dot-productcircuitry of FIG. 5A in accordance with an embodiment.

FIG. 8 is a diagram of an illustrative floating-point format conversioncircuit within the hybrid floating-point dot-product circuitry shown inFIG. 5A in accordance with an embodiment.

FIG. 9 is a diagram of an illustrative normalization circuit within thehybrid floating-point dot-product circuitry shown in FIG. 5A inaccordance with an embodiment.

FIG. 10A is a diagram of an illustrative barrel shifter.

FIGS. 10B and 10C are diagrams of illustrative carry-chain based barrelshifting circuits in accordance with some embodiments.

DETAILED DESCRIPTION

The present embodiments relate to a programmable integrated circuit andin particular, circuitry on a programmable integrated circuit forefficiently supporting machine learning training. It will be recognizedby one skilled in the art, that the present exemplary embodiments may bepracticed without some or all of these specific details. In otherinstances, well-known operations have not been described in detail inorder not to unnecessarily obscure the present embodiments.

An illustrative embodiment of an integrated circuit such as programmablelogic device (PLD) 100 is shown in FIG. 1. As shown in FIG. 1,programmable logic device 100 may include a two-dimensional array offunctional blocks, including logic array blocks (LABs) 110 and otherfunctional blocks, such as random access memory (RAM) blocks 130 andspecialized processing blocks such as specialized processing blocks(SPB) 120 that are partly or fully hardwired to perform one or morespecific tasks such as mathematical/arithmetic operations. Functionalblocks such as LABs 110 may include smaller programmable regions (e.g.,logic elements, configurable logic blocks, or adaptive logic modules)that receive input signals and perform custom functions on the inputsignals to produce output signals. Device 100 may further includeprogrammable routing fabric that is used to interconnect LAB's 110 withRAM blocks 130 and specialized processing blocks 120 (sometimes referredto as digital signal processing or DSP blocks) The combination of theprogrammable logic and routing fabric is sometimes referred to as “soft”logic, whereas the DSP blocks are sometimes referred to as “hard” logic(i.e., circuit blocks that can operate independently from and do notrely on soft logic). In general, device 100 may also include other typesof hard logic circuitry.

Programmable logic device 100 (e.g., a field-programmable gate array or“FPGA”) may contain programmable memory elements for configuring thesoft logic. Memory elements may be loaded with configuration data (alsocalled programming data) using input/output elements (IOEs) 102. Onceloaded, the memory elements provide corresponding static control signalsthat control the operation of one or more LAB s 110, programmablerouting fabric, and optionally SPBs 120 or RAMs 130. In a typicalscenario, the outputs of the loaded memory elements are applied to thegates of metal-oxide-semiconductor transistors (e.g., pass transistors)to turn certain transistors on or off and thereby configure the logic inthe functional block including the routing paths. Programmable logiccircuit elements that may be controlled in this way include parts ofmultiplexers (e.g., multiplexers used for forming routing paths ininterconnect circuits), look-up tables, logic arrays, AND, OR, NAND, andNOR logic gates, pass gates, etc.

The memory elements may use any suitable volatile and/or non-volatilememory structures such as random-access-memory (RAM) cells, fuses,antifuses, programmable read-only-memory memory cells, mask-programmedand laser-programmed structures, mechanical memory devices (e.g.,including localized mechanical resonators), mechanically operated RAM(MORAM), programmable metallization cells (PMCs), conductive-bridgingRAM (CBRAM), combinations of these structures, etc. Because the memoryelements are loaded with configuration data during programming, thememory elements are sometimes referred to as configuration memory,configuration RAM (CRAM), configuration memory elements, or programmablememory elements.

In addition, programmable logic device 100 may have input/outputelements (IOEs) 102 for driving signals off of device 100 and forreceiving signals from other devices. Input/output elements 102 mayinclude parallel input/output circuitry, serial data transceivercircuitry, differential receiver and transmitter circuitry, or othercircuitry used to connect one integrated circuit to another integratedcircuit. As shown, input/output elements 102 may be located around theperiphery of the chip. If desired, the programmable logic device mayhave input/output elements 102 arranged in different ways. For example,input/output elements 102 may form one or more columns of input/outputelements that may be located anywhere on the programmable logic device(e.g., distributed evenly across the width of the PLD). If desired,input/output elements 102 may form one or more rows of input/outputelements (e.g., distributed across the height of the PLD).Alternatively, input/output elements 102 may form islands ofinput/output elements that may be distributed over the surface of thePLD or clustered in selected areas.

The routing fabric (sometimes referred to as programmable interconnectcircuitry) on PLD 100 may be provided in the form of vertical routingchannels 140 (i.e., interconnects formed along a vertical axis of PLD100) and horizontal routing channels 150 (i.e., interconnects formedalong a horizontal axis of PLD 100), each routing channel including atleast one track to route at least one wire. If desired, routing wiresmay be shorter than the entire length of the routing channel. A length Lwire may span L functional blocks. For example, a length four wire mayspan four functional blocks. Length four wires in a horizontal routingchannel may be referred to as “H4” wires, whereas length four wires in avertical routing channel may be referred to as “V4” wires.

Furthermore, it should be understood that the present embodiments may beimplemented in any integrated circuit. If desired, the functional blocksof such an integrated circuit may be arranged in more levels or layersin which multiple functional blocks are interconnected to form stilllarger blocks. Other device arrangements may use functional blocks thatare not arranged in rows and columns.

Programmable device 100 may be used to support training neural networks.Training neural networks such as multilayer perceptrons (MLP) is acompute-intensive process that involves repeated forward and backwardsoperations which include dense matrix multiplications. Due to thenumerical properties of the training data such as the requirement tosupport very small numbers, floating-point representations are oftenrequired. As a result, the overall training performance of a MLP neuralnetwork is limited by the overall floating-point throughput and also bythe memory bandwidth of the underlying compute architecture.

Device configurations in which FPGA 100 is used to support an MLP neuralnetwork are sometimes described herein as an example. This is however,merely illustrative. In general, the architecture and associatedtechniques described herein that improve the ability of device 100 tocarry out MLP training may be applied to other types of training andmachine learning processes.

Now delving into more detail, a multilayer perception is a neuralnetwork having several layers each characterized by a weight matrix.Each layer uses a non-linear activation function (e.g., Rectified LinearUnits or “ReLU”) and its inverse. The network propagates activationdata, grouped into bursts, through all layers, and the learning processdetermines the required weight changes for each layer. Over severaliterations of computation, the network learns weight matrices sensitiveto some target function.

The training of such network may involve two passes: (1) a forward pass,where each successive layer performs a matrix multiplication using thecurrent weight matrix and the previous layer's output passed through theactivation function; and (2) a backward pass that computes the gradientof the activation data and determines the changes that need to beapplied to the weight matrix. The weight update policy may be based onstochastic gradient descent (SGD), as an example. In addition, a biasvector is also learned alongside the weight matrix.

The memory required to store all the weight and activation matrices formultiple layers likely exceeds the on-chip storage capacity available onan FPGA. All matrices may therefore be stored in a row-wise format in anexternal memory device such as an off-chip double date rate 4 (DDR4)dynamic random-access memory (DRAM) memory attached separately to anFPGA. External DDR memory is typically much more efficient at readingdata sequentially (i.e., traversing one matrix direction will work wellbut not when traversing a different direction). The data should bereordered in such a way so that large sequences of consecutiveaddresses, which can be grouped into bursts, can be jointly retrieved.Having adequate DDR memory bandwidth can help sustain maximumperformance.

FIG. 2 is a diagram of an illustrative machine learning trainingarchitecture 200 in accordance with an embodiment. As shown in FIG. 2,training circuitry 200 may include a 3-stage pipeline that reads andwrites into off-chip memory 210 (e.g., DDR memory) connected via a ringtopology. The first pipeline stage may include a first matrix loadingcircuit 202 (e.g., a circuit configured to load matrix “A”) and a secondmatrix loading circuit 204 (e.g., a circuit configured to load matrix“B”). The second pipeline stage may include matrix multiplicationcircuitry 206 implemented using a systolic array (as an example). Thethird pipeline stage may include a store circuit 208 configured to loada resulting matrix “C” into off-chip memory 210.

Load circuits 202 and 204 may incorporate optional transpositions andactivation functions within the pipeline in order to reduce memorytraffic. These operations may mutate the multiplication inputs andoutputs inline, either prior or after the multiplication at block 206.Stochastic gradient descent may be performed concurrently at circuit212. If desired, certain operations can be selectively bypassed to allowconfiguring the pipeline for different training flow equations.Configured in this way, each pipeline stage can operate on a matrix tileusing a double buffering scheme to pass the results onto the nextpipeline stage. The pipeline is kept occupied by issuing operations onmultiple matrix tiles through the pipeline.

As described above, matrix multiplication may be supported usingsystolic arrays. Systolic arrays for matrix multiplication includeprocessing elements (PEs) and control logic for coordinating the PEs.FIG. 3 is a diagram of a systolic array processing element 300 inaccordance with an embodiment. As shown in FIG. 3, a given row of PEsare loaded with matrix A using row feeder circuit 302, whereas a givencolumn of PEs are loaded with matrix B using column feeder circuit 304.Processing element 300 may include hybrid floating-point dot-productcircuitry 310 and also a local accumulation storage circuit 312 (e.g.,an accumulation shift register) for temporarily holding intermediateresults. Accumulated data may be selectively fed back as inputs to thedot-product circuitry 310 via multiplexer 314 (see accumulated inputdata ACC). Processing element 300 in the given row and the given columnmay be configured to interface the computation of the elements itcomputes to accommodate the propagation latency of the dot product.

Arranged in this way, all PEs in the same row or column of the systolicarray can share the same operand. A chip-wide distribution network maybe used to provide data to all of the PEs in the systolic array. Thematrix A row feeder circuit 302 may receive data originating from load Acircuit 202 of FIG. 2, whereas the matrix B column feeder circuit 304may receive data originating from load B circuit 204. The “hybrid”nature of the floating-point dot-product circuitry 310 stems for theusage of both hard floating-point multipliers (e.g., using DSP blocks)and soft floating-point multipliers. The latency through the softfloating-point multiplier portion will be greater than the latencythrough the hard floating-point multiplier portion.

This imbalance of arrival times is schematically represented by theL-shaped outline of circuitry 310, where the lower elementscorresponding to the soft multiplier inputs are allowed to arrive soonerand where the upper elements corresponding to the hard multiplier inputshave to be delayed to account for the scheduling imbalance. In order tominimize the number of registers that have to be inserted intoprocessing element 300, the input delay is implemented next to thefeeder circuitry (e.g., using delay registers 303 at the output offeeder 302 and using delay registers 305 at the output of feeder 304)and preserved along the data bus by ensuring that identical delayincrements are added to all bus paths. To ensure that both dot productoperands arrive at the same time, the delays introduced by registers 303along each row and by registers 305 along each column should be matched.

FIG. 4 is a diagram showing an illustrative matrix allocation to asystolic array. As shown in FIG. 4, the storage capacity of the feederhas to accommodate a tile of width burst. This is required toaccommodate the smallest unit of off-chip DDR memory access that wouldutilize the memory bandwidth efficiently. A burst may be several memorywords (e.g., 32 bytes, 64 bytes, 128 bytes, 256 bytes, 512 bytes, 1024bytes, etc.).

The feeder memories should be populated with data words matching theorientation of the dot product operands, which allows the parallel fifthof all input operands every clock cycle. A double buffering scheme maybe employed, where one tile is loaded (a process that might takethousands of cycles for each matrix) while the previously loaded tile isbroadcast on the row or column bus. By adjusting the number of rows andcolumns, the degree of data reuse can be modified, and the time it takesto load a new matrix tile into the feeder circuits and the time it takesto issue all row and column permutations in the PE array can bebalanced.

The IEEE 754 single precision 32-bit floating-point format hastraditionally been used for dot-product data paths. The single precisionformat has one sign bit, eight exponent bits, and 23 three fraction bits(with an implied leading one bit to make up 24 total mantissaprecision). New research, however, seems to suggest that a lowerbitwidth floating-point format would be a more suitable candidate forimplementing dot-products due to its reduced memory bandwidthrequirements, as long as the reduction tree is implemented in singleprecision. For example, a 16-bit format (sometimes referred to as“BFLOAT16”) having one sign bit, eight exponent bits (i.e., wE=8), andseven fraction bits (i.e., wF=7) might be used. The reduced 16-bitoperands in BFLOAT16 allows for a potential 2× memory bandwidthimprovement.

Conventional FPGA designs are, however, heavily optimized for singleprecision (SP) arithmetic. For instance, single precision dot-productcircuit blocks typically map directly to DSP blocks running at nominalfrequencies. Thus, obtaining a higher than SP dot-product density for aBFLOAT16+SP dot-product is challenging. First, BFLOAT16 multipliersrequire a combination of DSP blocks and adaptive logic modules (ALM)within LABs 110 (FIG. 1) to implement. In order to obtain betterdot-product density than by just using DSP blocks, a customfloating-point pipeline architecture is provided.

In accordance with an embodiment, hybrid floating-point dot-productcircuitry 310 that utilizes both hard DSP blocks and custom softmultiplier blocks is shown in FIG. 5A. This exemplary hybrid dot-productarchitecture 310 is configured to implement a 16-element dot-product. Asshown in FIG. 5A, circuitry 310 may receive input operands A_(i), B_(i)both of which are in BFLOAT16 format, where i represents an index from 0to 15. The rightmost accumulate (ACC) input may be in the singleprecision format.

The first 12 sets of input operands feed into six custom 2-elementdot-product circuits 500. Each circuit 500 may be configured to generatea sum of two products and is sometimes referred to herein as a “dot2”circuit. In the example of FIG. 5A, circuit 500-1 may be configured tocompute (A0*B0+A1*B1); circuit 500-2 may be configured to compute(A2*B2+A3*B3); circuit 500-3 may be configured to compute (A4*B4+A5*B5);. . . ; and circuit 500-6 may be configured to compute(A10*B10+A11*B11). An example of an individual dot2 circuit 500-1 isshown in FIG. 5B. As shown in FIG. 5B, dot2 circuit 500-1 includes afirst multiplier 560 for computing A0*B0, a second multiplier 560 forcomputing A1*B1, and an adder for summing the results from the twomultiplier 560. Referring back to FIG. 5A, the outputs from the variousdot2 circuits 500 may be fed into an adder tree, which may include afirst stage of adders 510-1, 510-2, 510-3, a second stage of adders520-1 and 520-2, and a third adder stage 530.

A DSP block within an FPGA may be operated either in a floating-pointmode or a fixed-point mode. In the fixed-point mode, the DSP block maybe configured so that two 18×18 multipliers operate independently orconfigured into one larger 27×27 multiplier. The “hybrid” labeling ofarchitecture 310 is due to the usage of both a “hard” data path (e.g.,where the DSP blocks corresponding to the higher inputs A,B[12:15] areconfigured to operate in the floating-point mode) and a “soft” data path(e.g., where the DSP blocks corresponding to the lower inputs A,B[0:11]are configured to operate in the fixed-point mode in order to access thetwo 18×18 multipliers 502 independently).

As shown in FIG. 5A, one of the two 18×18 multipliers 502 may be used bycircuit 500-1, whereas the second of the two 18×18 multipliers may beused by circuit 500-2. Each 18×18 multiplier may be further used tosupport two smaller 8×8 multiplications performed by multipliers 560(see, e.g., FIG. 5B). From a resource utilization perspective, each dot2circuit 500 uses half a DSP block and some general purpose soft logic,which may include various operations such as exponent add, exponentdifference, integer add, alignment shifting, etc. A single 18×18multiplier may only natively support two 6×6 multiplications, soadditional soft logic is needed to support two 8×8 multiplications forBFLOAT16. Thus, each dot2 circuit 500 may also be considered a “hybrid”circuit since it uses a DSP block in fixed-point mode with soft logic toextend the support to BFLOAT16. Portion 590-1 of circuitry 310 thatincludes the dot2 circuits therefore corresponds to a hard and soft datapath, whereas portion 590-2 of circuitry 310 that includes the DSPblocks operating in floating-point mode may therefore correspond to thehard data path.

The next two sets of input operands A,B [12:13] may be computed inparallel using two DSP blocks 120-1 and 120-2 configured in thefloating-point mode. The conversion from the BFLOAT16 input format tothe single precision floating-point format can be done by zero paddingBFLOAT16's 7-bit mantissa. The output of DSP block 120-1 may merge intothe adder tree via conversion circuit 512, and the result at the finalstage of the adder tree may be normalized into an IEEE 754-like format.The remaining input operands may be computed using DSP blocks 120-3 and120-4 also configured in the floating-point mode. Blocks 120-3 and 120-4may collectively compute (A14*B14+(A15*B15+ACC)). The single-precisionadder circuit 190 within DSP block 120-2 may be used to compute thefinal addition between the normalized result from the adder tree and theoutput from blocks 120-3 to calculate final output (A0*B0+A1*B1+A2*B2+ .. . +A14*B14+A15*B15+ACC).

The labeling on the left edge of FIG. 5A illustrates when the inputsassociated with the different elements should be scheduled for arrival.Inputs A,B[0:13] can be scheduled to arrive in parallel at cycle #1.However, inputs A,B[14:15] should be delayed so that they arrive laterafter cycle #14. This input scheduling imbalance again lends to theL-shaped logical representation of FIG. 3.

FIG. 6A is a diagram of a classical floating-point multiplier 600. Themultiplier blocks 180 within the DSP blocks 120 in FIG. 5A mayoptionally be implemented using this classical multiplier architecture.Multiplier 600 is configured to receive a first floating-point input Xhaving a sign bit Sx, an exponent Ex, and a mantissa Mx and a secondfloating-point input Y having a sign bit Sy, an exponent Ey, and amantissa My. Exponents Ex and Ey have the same exponent width wE.Mantissas Mx and My have a precision that is equal to (1+wF) ifaccounting for the implied leading one bit in front of the radix point.

Multiplier 600 includes a logic XOR gate 610 that receives sign bits Sxand Sy and generates a corresponding output sign bit Sp for theresulting product. Multiplier 600 further includes a mantissa multiplierblock 630, a mantissa normalization block 632, a mantissa sticky bitblock 634, a round bit computing block 636, a mantissa rounding block638, and a mantissa update block 640 for handling the mantissa of theresulting product. Mantissa multiplier block 630 multiplies Mx by My toobtain a mantissa product, which has up to (2+2*wF) bits.

Block 632 receives the top (2+wF+1) bits of the mantissa product andnormalizes this value to the interval [1,2) by checking the mostsignificant bit (MSB) of the mantissa product. If the MSB is equal to“1”, then block 632 performs a 1-position right shift. This MSB bit isalso forwarded to the exponent update block 622. The bit shifted outduring the 1-position right shift is forwarded to the Rnd block 636together with the two least significant bits (LSBs) of the normalizedmantissa product.

Block 634 computes sticky bits from the bottom (wF-1) bits of themantissa product. The sticky bit is the logic OR'ed result of all ofthese bottom bits. Thus, if any of the (wF-1) bits are high, then theoutput of the sticky bit block 634 will be equal to “1”. The Rnd block636 receives a partial sticky value from block 634 and the shifted-outvalue from block 632 to product a final sticky value. The additional twoLSBs forwarded from block 632 represent the mantissa LSB (i.e., T) and arounding bit (R). Bits T, R, and the final sticky value are usedcollectively to produce a 1-bit “Rnd” signal that will be added to theLSB of the normalized mantissa at rounding block 638. Rounding block 638is composed of one integer adder for adding the Rnd value computed byblock 636 to the normalized fraction. This adder has wF bits andproduces a carry-out signal, which is passed to the exponent updateblock 622.

Mantissa update block 640 receives the overflow and underflow signalsfrom block 624 and flushes the mantissa to zero if either an overflow orunderflow has occurred. This is required since IEEE 754 has specificencodings for infinity and zero. The resulting final mantissa value Mpwill have (1+wF) bits.

Multiplier 600 further includes an exponent addition circuit 620 forsumming exponents Ex and Ey. Block 622 increments the sum of theexponents when (i) the product of the mantissa product is greater thanor equal to two or (ii) when the resulting mantissa after rounding isgreater than or equal to two. Block 624 checks that the final exponentis within the allowed bounds. For single precision and BFLOAT16, themaximum exponent is equal to 127 while the minimum exponent is equal to−126 since both formats use 8 exponent bits. If the exponent value isgreater than 127, then the multiplier should return infinity (i.e., byreturning a string of wE ones “11111111”). If the exponent value is lessthan −126, then the multiplier should return zero (i.e., by returning astring of wE zeros “00000000”).

Configured in this way, multiplier 600 performs a mantissamultiplication (at block 630), 1-bit normalization (at block 632),rounding (using blocks 634, 636, and 638) and overflow/underflowfollowed by exception handling.

FIG. 6B is a diagram of an illustrative customized floating-pointmultiplier 560 within the 2-element dot-product circuit shown in FIG. 5Bin accordance with an embodiment. As shown in FIG. 6B, multiplier 560may only include logic XOR gate 650 for computing the sign bit, anexponent adder circuit 652, a mantissa multiplier circuit 654 (which canbe implemented using only one 18×18 multiplier within a DSP block), anda bit truncating circuit 656.

In contrast to the classical multiplier shown in FIG. 6A, custommultiplier 560 skips the normalization stage (but requires an extraoverflow guard bit) and skips the rounding stage (but requires anadditional mantissa bit). All (2+2wF) bits, which includes theadditional mantissa bit, are then fed to truncate block 656. Truncateblock 656 will then truncate or discard all bits beyond the (2+w) mostsignificant bits. The parameter “w” might be set equal to 8 (as anexample) or some other value (e.g., w may be set equal to 6, 7, 9, 10,5-12, 4-16, or some other suitable integer value) that can be adjustedto trade off resource for accuracy. Multiplier 560 also skips theoverflow/underflow and exception handling by extending the exponent by 2bits to include one sign bit and one overflow guard bit (e.g., the finalexponent will have (2+wE) bits).

Compared to multiplier 600 of FIG. 6A, multiplier 560 of FIG. 6Bprovides significant area and power savings while offering comparableaccuracy for the overall dot-product circuitry, which provides atangible improvement to the underlying computer functionality whensupporting machine learning processes.

FIG. 7A is a diagram of a classical floating-point adder 700. The adderblocks 190 within the DSP blocks 120 in FIG. 5A may optionally beimplemented using this classical adder architecture. Adder 700 isconfigured to receive a first floating-point input X having a sign bitSx, an exponent Ex, and a mantissa Mx and a second floating-point inputY having a sign bit Sy, an exponent Ey, and a mantissa My. Exponents Exand Ey have the same exponent width wE. Mantissas Mx and My have aprecision that is equal to (1+wF) if accounting for the implied leadingone bit in front of the radix point.

Adder 700 includes a multiplexer 702, a logic XOR gate 704, an exponentdifference block 706, a mantissa swap block 708, a two's complementblock 710, an absolute value (ABS) block 712, an alignment shifter 714,an integer adder block 716, a sign-magnitude conversion block 718, aleading zero counter 720, a normalization shifter 722, a rounding block724, an exponent update block 726, and a sign block 728. Logic XOR gate704 simply computes the exclusive OR of Sx and Sy.

Block 706 computes the difference of Ex minus Ey. Multiplexer 702outputs the maximum of the two exponents Ex and Ey. The select line ofmultiplexer 702 is driven by the sign bit (i.e., the MSB) of thedifference of (Ex-Ey) computed by block 706. If the difference isnegative (i.e., if the sign bit of Ex-Ey is “1”), then multiplexer 702will forward Ey; otherwise, it will output Ex.

Mantissa swap block 708 selectively swaps the mantissas depending onwhether the difference computed by block 706 is negative (i.e., a swapis required if Ex is smaller than Ey). The mantissa value correspondingto the smaller exponent will be converted to the two's complement usingblock 710 if the output of XOR block 704 is high (i.e., if the signs ofX and Y are different). Block 712 calculates the absolute value of theexponent difference. For example, if the exponent difference is equal to−2, then the mantissa corresponding to the smaller exponent needs to beshifted by two bit positions using alignment shifter 714 with respect tothe mantissa corresponding to the larger exponent.

The mantissa corresponding to the larger exponent and the mantissacorresponding to the smaller exponent, after being aligned by shifter714, are then summed together by integer adder 716. The output of adder716 is then converted to sign-magnitude format using block 718. Leadingzero counter 720 determines the number of leading zeros in the convertedsign-magnitude value. Normalization shifter 722 then normalizes thesign-magnitude value by shifting that value left based on the number ofleading zeros determined by counter 720. This normalized mantissa maythen be rounded by block 724 to output the final mantissa of the sum(Ms).

Sign block 728 may output the final sign bit of the sum (Ss) based onSx, Sy, the output of XOR gate 704, the exponent difference, and alsothe output of integer adder 716. Exponent update block 726 receives thelarger exponent value from block 702, the leading zero count value fromblock 720, and also the carry-out bit from rounding block 724. If thecarry-out of rounding block 724 is a “1”, then the larger exponentreceived at block 726 from multiplexer 702 will be incremented by one.Otherwise, if the leading zero count is “0” (indicating that the sum ofthe two mantissas is greater than or equal to two), then the receivedexponent will be decremented by one. If the leading zero count is “1”,then the received exponent is not updated. If the leading zero count isc, where c is greater than one, then the value (c-1) will be subtractedfrom the received exponent to generate final exponent Es. Forsimplicity, the overflow/underflow block that checks the bounds of theexponents is omitted from FIG. 7A.

FIG. 7B is a diagram of an illustrative customized floating-point adder562 within the 2-element dot-product circuit shown in FIG. 5B inaccordance with an embodiment. Floating-point adder 562 may thereforesometimes be referred to as the dot-2 adder. As shown in FIG. 7B, adder562 may only include multiplexer 730, exponent difference circuit 732,mantissa swap circuit 734, a first two's complement circuit 736-1, asecond two's complement circuit 736-2, an absolute value (ABS) circuit738, an alignment shifter circuit 740, an integer adder 742, and atruncation circuit 744.

In contrast to the classical adder shown in FIG. 7A, custom adder 562receives products in a customized format from the output of multiplier560 shown in FIG. 6B. As described above in connection with FIG. 6B,each of the arriving product signal may have a 1-bit sign field, anexponent field represented by (2+wE) bits, and an un-normalized mantissafield that requires (2+w) bits. Adder 562 is “custom” in the sense thatit operates on this non-standard input format.

Circuits 736-1 and 736-2 converts the un-normalized mantissas into theirtwo's complement equivalent. After alignment by shifter 740, thefixed-point sum of the two mantissas is computed at block 742. The rightshifter 740 is less costly compared to alignment shifter 714 for circuit700 since it does not need to compute the sticky bits typically requiredfor rounding. The rounding-to-nearest step is also skipped and isreplaced by truncation block 744 which truncates the fractional portionto wA bits and discards all bits beyond the wA positions to the right ofthe radix point. Adjustable parameter wA therefore dictates the positionof the truncation. The parameter “wA” might be set equal to 8 (as anexample) or some other value (e.g., wA may be set equal to 6, 7, 9, 10,5-12, 4-16, or some other suitable integer value) that can be adjustedto trade off resource utilization for accuracy. Adder 562 itself mayoutput signals in yet another custom format composed of an exponentfield Es with (2+wE) bits and a mantissa field Ms with (4+wA) bits. Theresulting mantissa Ms will be in the two's complement format, so noextra sign bit is required at the output of adder 562.

Compared to adder 700 of FIG. 7A, custom adder 562 of FIG. 7B providessignificant area and power savings while offering comparable accuracyfor the overall dot-product circuitry, which provides a tangibleimprovement to the underlying computer functionality when supportingmachine learning processes. Custom adder 562 provides these improvementsby directly outputting the exponent of the sum without an exponentupdate circuit, by generating the mantissa of the sum without asign-magnitude converter, without a leading zero counter, without anormalization shifter, and without a rounding circuit.

FIGS. 7C, 7D, and 7E illustrate suitable implementations for the customfloating-point adders in the adder tree (see, e.g., adders 510, 520, and530 in FIG. 5A). The adders in the adder tree are “customized” in thesense that they are configured to receive inputs having the customnumerical format output by adder 562 of FIG. 7B.

FIG. 7C is a diagram of customized floating-point adder 510 in the firstadder stage of hybrid floating-point dot-product circuitry 310 (see,e.g., adders 510-1, 510-2, and 510-3 in FIG. 5A). As shown in FIG. 7C,adder 510 may include a multiplexer 750 (having similar structure andfunction as multiplexer 730 of FIG. 7B), an exponent difference circuit752 (having similar structure and function as block 732), mantissaswapping circuit 754 (having similar structure and function as block734), absolute value circuit 756 (having similar structure and functionas block 738), alignment shifting circuit 758 (having similar structureand function as block 740), integer adder 760 (corresponding to adderblock 742), and truncation circuit 762 (corresponding to block 744).

Compared to the dot-2 adder 562, adder 510 is less complex since theconversion from the sign-magnitude to the two's complement is no longerrequired (i.e., adder 562 does not include any two's complementconverter). Note that the output of integer adder 760 has 5 bits infront of the radix point, with the extra one MSB to prevent overflow.After truncation at block 762, the resulting mantissa will have (5+wA+1)bits, with another extra LSB to optionally improve accuracy with thetruncation. In other words, the mantissa width may increase by two bitsat the first adder level.

FIG. 7D is a diagram of customized floating-point adder 520 in thesecond adder stage of hybrid floating-point dot-product circuitry 310(see, e.g., adders 520-1 and 520-2 in FIG. 5A). Adder 520 has asubstantially similar structure as adder 510, except the mantissa swapcircuit 754′ and the mantissa alignment shifter 758′ now operates on(5+wA+1) bits. The output of integer adder 760′ now has 6 bits in frontof the radix point, with the another extra MSB to prevent overflow.After truncation at block 762′, the resulting mantissa will have(6+wA+2) bits, with another extra LSB to optionally improve accuracywith the truncation. In other words, the mantissa width may increase byanother two bits at the second adder level.

FIG. 7E is a diagram of customized floating-point adder 530 in the thirdadder stage of hybrid floating-point dot-product circuitry 310 (see,e.g., adder 530 in FIG. 5A). Adder 530 has a substantially similarstructure as adders 510 and 520, except the mantissa swap circuit 754″and the mantissa alignment shifter 758″ now operates on (6+wA+2) bits.The output of integer adder 760″ will now have 7 bits in front of theradix point, with the additional MSB to prevent overflow. Aftertruncation at block 762″, the resulting mantissa will have (7+wA+3)bits, with another extra LSB to optionally improve accuracy with thetruncation. In other words, the mantissa width may increase by yetanother two bits at the third adder level.

Referring briefly back to FIG. 5A, the output of DSP block 120-1 is inthe single precision format and needs to be converted to the customformat using conversion circuit 512 prior to being merged with adder520-2 at the second level of the adder tree.

FIG. 8 is a diagram showing one suitable implementation of suchfloating-point format converter 512. As shown in FIG. 8, conversioncircuit 512 may include an exponent subtraction circuit 802, a circuit804 for converting the input mantissa from (1+wF) bit width into itstwo's complement equivalent with (2+wF) bit width, and a bit selectioncircuit 806 for selecting only the top (5+wA+1) bits from the output ofblock 804. The remaining bits to the right of the (wA+1) fractional bitsmay be discarded via truncation or can optionally be rounded to nearest(e.g., by adding “1” to the bit to the immediate right of the (wA+1) bitand then truncating the result). Circuit 806 that performs truncation orrounding is sometimes referred to as a bit reduction circuit. Theresulting converted mantissa Mc will have 5 bits to the left of theradix point and (wA+1) bits to the right of the radix point.

The single precision mantissa has to be aligned to the left since thecustom format at the input of the second adder stage requires 4 bits tothe left of the radix point. This is accomplished by shifting themantissa to the left by 3 bit positions while ensuring that moremantissa bits can be kept to optimize for accuracy. To compensate forthis shifting, exponent subtraction block 802 may be configured tosubtract 3 from the input exponent value, where the converted exponentoutput Ec will have (2+wE) bits. The format at the output of converteris (2+wE) exponent bits and (5+wA+1) mantissa bits, which matches thenumerical format output by adder 510 in the first adder stage andreceived at adder 520 in the second adder stage.

Referring briefly back again to FIG. 5A, the output of third adder stage530 is in the custom format with (7+wA+3) mantissa bits (see FIG. 7E),and thus needs to be normalized using normalization circuit 540 prior tobeing combined with another single precision value at adder 190 of DSPblock 120-2. FIG. 9 is a diagram of normalization circuit 540 inaccordance with an embodiment. As shown in FIG. 9, normalization circuit540 may include at least a sign-magnitude converting block 902, aleading zero counter 904, a normalization shifter 908, an exponentupdate block 906, an overflow/underflow exponent handling block 910, andan overflow/underflow mantissa handling and right zero padding block912.

Since the IEEE 754 single precision format adopts a sign-magnituderepresentation for the mantissa, block 902 may be configured to convertthe incoming mantissa with (7+wA+3) bits into the sign-magnitude format.Counter 904 may be configured to identify the number of leading zeros inthe converted mantissa. Normalization shifter 908 may then shift theconverted sign-magnitude value based on the number of leading zerosdetermined by counter 904. Block 906 may be configured to update theexponent by incrementing Ex by (6-c), where “c” denotes the number ofleading zeros identified by counter 904.

Block 910 checks that the updated exponent is within the allowed bounds.For single precision and BFLOAT16, the maximum exponent is equal to 127while the minimum exponent is equal to −126 since both formats use 8exponent bits. If the updated exponent value is greater than 127, thennormalization circuit 540 should return infinity. If the updatedexponent value is less than −126, then the normalization circuit 540should return zero. The overflow/underflow information may then beforwarded to block 912 for flushing the mantissas to all zeros if eitheroverflow or underflow occurs. Block 912 may also handle right zeropadding to make up for a total of 23 fraction bits for single precision.

Some of the more costly components in terms of ALM resource usage withinthe hybrid dot-product circuitry 310 are the alignment shifters such asalignment shifter block 740 within floating-point adder of the typeshown in FIG. 7B, alignment shifter 758 in the first adder stage of thetype shown in FIG. 7C, alignment shifter 758′ in the second adder stageof the type shown in FIG. 7D, alignment shifter 758″ in the third adderstage of the type shown in FIG. 7E, and normalization shifter 908 in thenormalization circuit 540 of the type shown in FIG. 9. Thus, anefficient implementation of these shifters is key for optimizing theefficiency of the entire machine learning system.

FIG. 10A is a diagram of a conventional barrel shifter 1000 for shiftingbits left. As shown in FIG. 10A, barrel shifter 1000 receives an 8-bitinput A[0:7] and control bits S[0:1] for controlling the amount ofshifting performed. Barrel shifter 1000 includes: a first 4:1multiplexer 1002-0 that receives A0 at its “0” input while the remaininginputs receive “0”; a second 4:1 multiplexer 1002-1 that receives A1 atits “0” input, A0 at its “1” input, and zeros at the remaining inputs; athird 4:1 multiplexer 1002-2 that receives A2 at its “0” input, A1 atits “1” input, A0 at its “2” input, and zero at its “3” input; a fourth4:1 multiplexer 1002-3 that receives A3 at its “0” input, A2 at its “1”input, A1 at its “2” input, and A0 at its “3” input; a fifth 4:1multiplexer 1002-4 that receives A4 at its “0” input, A3 at its “1”input, A2 at its “2” input, and A1 at its “3” input; a sixth 4:1multiplexer 1002-5 that receives A5 at its “0” input, A4 at its “1”input, A3 at its “2” input, and A2 at its “3” input; a seventh 4:1multiplexer 1002-6 that receives A6 at its “0” input, A5 at its “1”input, A4 at its “2” input, and A3 at its “3” input; and an eighth 4:1multiplexer 1002-7 that receives A7 at its “0” input, A6 at its “1”input, A5 at its “2” input, and A4 at its “3” input.

Multiplexers 1002-0, 1002-1, 1002-2, 1002-3, 1002-4, 1002-5, 1002-6, and1002-7 are used to generate Z[0:7], which represents the shifted output.In this arrangement, every 4:1 multiplexer 1002 implemented on an FPGAwould require using a 6-input lookup table (LUT) circuit. As a result, aconventional implementation of a barrel shifter of N data bits and twocontrol bits will require at least N 6-input LUTs.

In accordance with another suitable embodiment, a carry-chain basedbarrel shifter 1010 is shown in FIG. 10B. In contrast to theconventional barrel shifter of FIG. 10 (which uses a combinatorialstructure relying on 4:1 multiplexers), the carry-chain based shifter1010 uses a series of arithmetic cells 1012. Using an arithmetic basedarchitecture instead of a combinatorial structure results in a smalleroverall shifter structure that uses fewer wires, thereby improving area,power, cost, and performance.

As shown in FIG. 10B, carry-chain based barrel shifter 1010 may beconfigured to receive an 8-bit input A[0:7] and control bits S[0:1] forcontrolling the amount of shifting performed. The example of FIG. 10B inwhich shifter 1010 receives only eight input buts and two control bitsis merely illustrative and is not intended to limit the scope of thepresent embodiments. If desired, the carry-chain based shifterarchitecture of FIG. 10B may be applied to shifters of any suitablesize.

Shifter 1010 may include arithmetic cells 1012-0, 1012-1, 1012-2,1012-3, 1012-4, 1012-5, 1012-6, and 1012-7. Each arithmetic cell 1012with index k receives both control bits S[0:1] and two data bits A[k]and A[k−2] spaced bit distance apart. Each arithmetic cell 1012 mayinclude a first 4-input LUT 1020 and a second 4-input LUT 1022, each ofwhich has four input ports a, b, c, d. The first 4-input LUT 1020 may beconfigured to compute (!a&!b&c OR !a&b&d), where “!” represents the“not” function. The second 4-input LUT 1022 may be configured to compute(a&!b&c OR a&b&d).

Each arithmetic cell 1012 may further include a logic XOR gate 1024having a first input that receives a carry-out from the previousarithmetic cell in the chain (e.g., the first arithmetic will receive acarry-in of “0”), a second input that receives the output of LUT 1020,and an output on which a corresponding shifted output bit is generated.Each arithmetic cell 1012 may also include a simple 2:1 multiplexer 1026having a first (0) input that receives the output of LUT 1022, a second(1) input that receives the carry-out from the previous arithmetic cell,and an output on which a corresponding carry-out is fed to thesucceeding arithmetic cell in the chain.

The input connections of each arithmetic cell 1012 are illustrated indetail in FIG. 10B. The a, b, c, and d input ports of LUTs 1020 and 1022in both the first arithmetic cell 1012-0 and the second arithmetic cell1012-1 may be configured to receive input bits S0, S1, A0, and 0,respectively. The a, b, c, and d input ports of LUTs 1020 and 1022 inthe third arithmetic cell 1012-2 may be configured to receive input bitsS0, S1, A2, and A0, respectively. The a, b, c, and d input ports of LUTs1020 and 1022 in the fourth arithmetic cell 1012-3 may be configured toreceive input bits S0, S1, A3, and A1, respectively. The a, b, c, and dinput ports of LUTs 1020 and 1022 in the fifth arithmetic cell 1012-4may be configured to receive input bits S0, S1, A4, and A2,respectively. The a, b, c, and d input ports of LUTs 1020 and 1022 inthe sixth arithmetic cell 1012-5 may be configured to receive input bitsS0, S1, A5, and A3, respectively. The a, b, c, and d input ports of LUTs1020 and 1022 in the seventh arithmetic cell 1012-6 may be configured toreceive input bits S0, S1, A6, and A4, respectively. The a, b, c, and dinput ports of LUTs 1020 and 1022 in the eight arithmetic cell 1012-7may be configured to receive input bits S0, S1, A7, and A5,respectively.

Arranged in this way, each arithmetic cell 1012 may be configured toperform the following operation. If S[0:1] is equal to “00”, then Z=A[k]and the carry-out is zero. If S[0:1] is equal to “01”, then Z=A[k−2] andthe carry-out is zero. In either of these cases, the carry chain is notactivated and the received input data is routed directly to the outputof the same arithmetic cell.

If S[0:1] is equal to “10”, then Z is equal to the received carry-in andA[k] is routed to the carry-out. If S[0:1] is equal to “11”, then Z isagain equal to the received carry-in and A[k−2] is routed to thecarry-out. In either of these cases, the output data of that arithmeticcell is routed from the previous neighboring cell, and carry chain isactivated to route the received data input to the next succeeding cellin the chain.

Compared to the implementation of FIG. 10A that uses N instances of6-input LUTs, the architecture of FIG. 10B uses 16 or 2N 4-input LUTs. A6-input LUT is, however, 4× bigger in size than a 4-input LUT. As aresult, the shifter configuration of FIGS. 10B (and 10C) will be halfthe area than the conventional barrel shifter, which reduces cost andpower.

In the arrangement of FIG. 10B, the logic XOR gates 1024 and the 2:1multiplexers 1026 in each arithmetic cell 1020 are connected in seriesand make up the carry chain 1030. FIG. 10C illustrates another suitablearrangement that is similar to the architecture of FIG. 10B, but thecarry chain 1060 is implemented using adders 1050 connected in a chain.The first adder 1050 in the chain will also receive a carry-in of “0”.The adder-chain based barrel shifter of FIG. 10B can also providesubstantial area and cost savings relative to the convention barrelshifter of FIG. 10A. The improved shifting circuits of FIGS. 10B and 10Care not limited to use in machine learning training circuitry. Ifdesired, carry-chain based shifters may be included in any type ofarithmetic or compute system.

EXAMPLES

The following examples pertain to further embodiments.

Example 1 is an integrated circuit, comprising: first digital signalprocessing (DSP) blocks configured to operate in a floating-point mode;second digital signal processing (DSP) blocks configured to operate in afixed-point mode that is different than the floating-point mode; and anadder configured to receive a first signal from the first DSP blocksoperating in the floating-point mode and a second signal from the secondDSP blocks operating in the fixed-point mode.

Example 2 is the integrated circuit of example 1, wherein the first DSPblocks are optionally part of a hard data path, and wherein the secondDSP blocks are optionally part of a hard and soft data path.

Example 3 is the integrated circuit of any one of examples 1-2, whereinthe first and second DSP blocks are optionally configured to receiveinput signals of a first floating-point format, and wherein the firstDSP blocks are optionally configured to output signals in a secondfloating-point format that is different than the first floating-pointformat.

Example 4 is the integrated circuit of example 3, wherein the firstfloating-point format is optionally a BFLOAT16 format having one signbit, eight exponent bits, and at most seven fraction bits.

Example 5 is the integrated circuit of any one of examples 3-4, whereinthe second floating-point format is optionally a single-precision formathaving one sign bit, eight exponent bits, and twenty three fractionbits.

Example 6 is the integrated circuit of any one of examples 3-5, whereinthe second DSP blocks are optionally configured to output signals in athird floating-point format that is different than the first and secondfloating-point formats.

Example 7 is the integrated circuit of example 6, wherein the thirdfloating-point format optionally has more exponent bits than the firstfloating-point format.

Example 8 is the integrated circuit of any one of examples 6-7, whereinthe third floating-point format optionally has an adjustable number offraction bits that determines the amount of truncation for the thirdfloating-point format.

Example 9 is the integrated circuit of any one of examples 6-8,optionally further comprising a format conversion circuit configured toconvert signals from the second floating-point format to the thirdfloating-point format.

Example 10 is the integrated circuit of any one of examples 6-9, whereinthe second DSP blocks optionally rely on soft logic to supportoutputting the signals in the third floating-point format.

Example 11 is the integrated circuit of example 10, optionally furthercomprising first adder circuits configured to receive the signals fromthe second DSP blocks and to output signals in a fourth floating-pointformat that is different than the third floating-point format.

Example 12 is the integrated circuit of example 11, optionally furthercomprising an adder tree configured to receive signals from the firstadder circuits.

Example 13 is the integrated circuit of example 12, wherein the addertree optionally comprises a first adder stage configured to outputsignals in a fifth floating-point format that is different than thefourth floating-point format.

Example 14 is the integrated circuit of example 13, wherein the addertree optionally comprises a second adder stage configured to outputsignals in a sixth floating-point format that is different than thefifth floating-point format.

Example 15 is the integrated circuit of example 14, wherein the addertree optionally comprises a third adder stage configured to outputsignals in a seventh floating-point format that is different than thesixth floating-point format.

Example 16 is the integrated circuit of example 15, optionally furthercomprising a normalization circuit configured to receive signals fromthe adder tree and to convert signals from the seventh floating-pointformat to the second floating-point format.

Example 17 is hybrid floating-point arithmetic circuitry, comprising: afirst portion that includes only hard circuit blocks; a second portionthat includes both hard and soft circuits; and an adder in the firstportion, wherein the adder is configured to receive a first signal fromthe first portion and to receive a second signal from the secondportion.

Example 18 is the hybrid floating-point arithmetic circuitry of example17, wherein the hard circuit blocks in the first portion optionallycomprise first digital signal processing (DSP) blocks operating infloating-point mode, and wherein the hard circuits in the second portionoptionally comprise digital signal processing second digital signalprocessing (DSP) blocks operating in a fixed-point mode that isdifferent than the floating-point mode.

Example 19 is the hybrid floating-point arithmetic circuitry of any oneof examples 17-18, wherein second portion is optionally configured toreceive input signals from a feeder circuit, and wherein the firstportion is optionally configured to receive input signals from thefeeder circuit via a plurality of input delay registers to account forlatency imbalance between the first and second portions.

Example 20 is hybrid floating-point dot-product circuitry, comprising: ahard data path that includes digital signal processing (DSP) blocksconfigured in a floating-point mode; a hard and soft data path thatincludes soft logic and digital signal processing (DSP) blocksconfigured in a fixed-point mode; an adder configured to receive signalsfrom the hard data path and the hard and soft data path; and anaccumulation storage circuit configured to receive signals from theadder, wherein an additional adder in the hard data path is configuredto receive an accumulation signal from the accumulation storage via afeedback path.

Example 21 is circuitry, comprising: a two-element dot-product circuitconfigured to receive first, second, third, and fourth inputs, togenerate a first intermediate product from the first and second inputs,to generate a second intermediate product from the third and fourthinputs, and to compute a sum of the first and second intermediateproducts, wherein the two-element dot-product circuit comprises a firstmultiplier that generates the first intermediate product and a secondmultiplier that generates the second intermediate product, and whereinthe first multiplier comprises: an exponent adder circuit configured toadd the exponent of the first input and the exponent of the secondinput, wherein the exponent adder circuit is configured to directlygenerate the exponent of the first intermediate product.

Example 22 is the circuitry of example 21, wherein the first multiplieroptionally directly generates the exponent of the first intermediateproduct without an exponent update circuit.

Example 23 is the circuitry of any one of examples 21-22, wherein thefirst multiplier optionally directly generates the exponent of the firstintermediate product without an overflow and overflow checking circuit.

Example 24 is the circuitry of any one of examples 21-23, wherein thefirst multiplier optionally further comprises: a mantissa multipliercircuit configured to multiply the mantissa of the first input and themantissa of the second input; and a bit truncation circuit configured toreceive signals directly from the mantissa multiplier circuit and todirectly generate the mantissa of the first intermediate product.

Example 25 is the circuitry of example 24, wherein the bit truncationcircuit is optionally configured to perform an adjustable amount ofmantissa truncation to balance resource usage with accuracy.

Example 26 is the circuitry of any one of examples 24-25, wherein thefirst multiplier optionally directly generates the mantissa of the firstintermediate product without a normalization circuit.

Example 27 is the circuitry of any one of examples 24-26, wherein thefirst multiplier optionally directly generates the mantissa of the firstintermediate product without a rounding circuit.

Example 28 is the circuitry of any one of examples 21-27, wherein thetwo-element dot-product circuit optionally further includes an addercircuit configured to compute the sum of the first and secondintermediate products, and wherein the adder circuit optionallycomprises: an exponent multiplexing circuit configured to select eitherthe exponent of the first intermediate product or the exponent of thesecond intermediate product, wherein the exponent multiplexing circuitis further configured to directly generate the exponent of the sum.

Example 29 is the circuitry of example 28, wherein the adder circuitoptionally directly outputs the exponent of the sum without an exponentupdate circuit.

Example 30 is the circuitry of any one of examples 28-29, wherein theadder circuit optionally further comprises: a mantissa swapping circuithaving a first output and a second output; a first two's complementconversion circuit configured to receive a first mantissa value from thefirst output of the mantissa swapping circuit; and a second two'scomplement conversion circuit configured to receive a second mantissavalue from the second output of the mantissa swapping circuit.

Example 31 is the circuitry of any one of examples 28-30, wherein theadder circuit optionally generates the mantissa of the sum without asign-magnitude converter.

Example 32 is the circuitry of any one of examples 28-31, wherein theadder circuit optionally generates the mantissa of the sum without aleading zero counter and without a normalization shifter.

Example 33 is the circuitry of any one of examples 28-32, wherein theadder circuit optionally generates the mantissa of the sum without arounding circuit.

Example 34 is the circuitry of any one of examples 28-33, wherein theadder circuit optionally further comprises: an integer adder; and a bittruncation circuit configured to receive signals from the integer adderand to directly output the mantissa of the sum.

Example 35 is the circuitry of any one of examples 21-34, optionallyfurther comprising: additional two-element dot-product circuits; and anadder tree configured to receive sum signals from the two-elementdot-product circuit and the additional two-element dot-product circuits,wherein the adder tree comprises a first stage adder that includes: amantissa swapping circuit; and an alignment shifter that directlyreceives a signal from the mantissa swapping circuit.

Example 36 is the circuitry of example 35, wherein the adder treefurther optionally comprises a second stage adder having the samestructure as the first stage adder but is configured to generate signalswith a larger mantissa than the first stage adder.

Example 37 is circuitry, comprising: a plurality of dot-product circuitsconfigured to output sum signals; an adder tree configured to receivethe sum signals from the plurality of dot-product circuits; a digitalsignal processing (DSP) block configured to output an additional sumsignal in a given floating-point format; and a floating-point formatconversion circuit configured to convert the additional sum signal froma given floating-point format to another floating-point format of theadder tree.

Example 38 is the circuitry of example 37, wherein the floating-pointformat conversion circuit optionally comprises: an exponent subtractioncircuit configured to subtract the exponent of the additional sum signalby a predetermined integer; a two's complement converter circuitconfigured to receive the mantissa of the additional sum signal; and abit reduction circuit configured to receive signals from the two'scomplement converter circuit and to directly output a converted mantissavalue to the adder tree.

Example 39 is circuitry, comprising: a plurality of dot-product circuitsconfigured to output sum signals; an adder tree configured to receivethe sum signals from the plurality of dot-product circuits; and anormalization circuit configured to receive an output signal from theadder tree and to convert the output signal from a first floating-pointformat to a second floating-point format that is different than thefirst floating-point format.

Example 40 is the circuitry of example 39, wherein the normalizationcircuit optionally comprises: a sign-magnitude converter configured toreceive the mantissa of the output signal; a leading zero countercoupled to the sign-magnitude converter; a normalization shiftercontrolled by the leading zero counter; and a zero padding circuitconfigured to receive signals from the normalization shifter.

Example 41 is circuitry, comprising: a plurality of dot-product circuitsconfigured to output sum signals; and an adder tree configured toreceive the sum signals from the plurality of dot-product circuits,wherein the plurality of dot-product circuits and the adder treecomprise carry chain based shifting circuits.

Example 42 is the circuitry of example 41, wherein the plurality ofdot-product circuits are optionally configured to receive inputs havinga first floating-point format.

Example 43 is the circuitry of example 42, wherein the firstfloating-point format is optionally a BFLOAT16 format having one signbit, eight exponent bits, and seven fraction bits.

Example 44 is the circuitry of any one of examples 42-43, wherein theoutput sum signals optionally have a second floating-point format thatis different than the first floating-point format.

Example 45 is the circuitry of example 44, wherein the secondfloating-point format optionally has more exponent bits than the firstfloating-point format.

Example 46 is the circuitry of any one of examples 44-45, wherein thesecond floating-point format optionally has an adjustable number offraction bits that determines the amount of truncation for the secondfloating-point format.

Example 47 is the circuitry of any one of examples 41-46, wherein atleast one of the carry chain based shifting circuits optionally comprisea series of arithmetic cells connected in a chain.

Example 48 is the circuitry of example 47, wherein at least onearithmetic cell in the series optionally comprises: a first lookup tableconfigured to provide a first function; and a second lookup tableconfigured to provide a second function that is different than the firstfunction.

Example 49 is the circuitry of example 48, wherein the first and secondlookup tables are optionally configured to receive the same inputsignals.

Example 50 is the circuitry of any one of examples 48-49, wherein the atleast one arithmetic cell optionally further comprises a logic gateconfigured to receive signals from the first lookup table.

Example 51 is the circuitry of example 50, wherein the logic gateoptionally comprises a logic XOR gate.

Example 52 is the circuitry of any one of examples 50-51, wherein the atleast one arithmetic cell optionally further comprises a multiplexingcircuit configured to receive signals from the second lookup table, andwherein the multiplexing circuit is optionally controlled by the signalsoutput from the first lookup table.

Example 53 is the circuitry of any one of examples 48-52, wherein the atleast one arithmetic cell optionally further comprises an adderconfigured to receive signals from the first lookup table.

Example 54 is the circuitry of example 53, wherein the adder isoptionally coupled to at least one other arithmetic cell in the chain.

Example 55 is a shifting circuit, comprising: a first arithmetic cell;and a second arithmetic cell coupled to the first arithmetic cell in achain, wherein the first and second arithmetic cells include a carrychain that generates shifted output bits.

Example 56 is the shifting circuit of example 55, wherein the firstarithmetic cell is optionally configured to receive a first input bitand a control bit, and wherein the second arithmetic cell is optionallyconfigured to receive a second input bit and the control bit.

Example 57 is the shifting circuit of any one of examples 55-56, whereinthe first and second arithmetic cells optionally have identicalstructures.

Example 58 is the shifting circuit of any one of examples 55-57, whereinthe carry chain optionally comprises a plurality of logic gates andmultiplexing circuits connected in series.

Example 59 is the shifting circuit of any one of examples 55-58, whereinthe carry chain optionally comprises a plurality of adders connected inseries.

Example 60 is a bit shifting circuit, comprising: first four-inputlookup tables configured to receive input signals and to apply a firstfunction on the input signals; second four-input lookup tablesconfigured to receive the input signals and to apply a second functionthat is different than the first function on the input signals; and acarry chain configured to receive signals output from the first andsecond four-input lookup tables and to generate a shifted version of theinput signals.

For instance, all optional features of the apparatus described above mayalso be implemented with respect to the method or process describedherein. The foregoing is merely illustrative of the principles of thisdisclosure and various modifications can be made by those skilled in theart. The foregoing embodiments may be implemented individually or in anycombination.

What is claimed is:
 1. A programmable logic device (PLD), comprising:machine learning training circuitry, configured to train a neuralnetwork, comprising: a pipeline, comprising: a first stage circuitryconfigured to load a first matrix and a second matrix from off-chipmemory; a second stage configured to perform matrix multiplication ofthe first matrix and the second matrix; and a third stage configured toload a result of the second stage matrix multiplication to the off-chipmemory.
 2. The programmable logic device of claim 1, wherein the firststage circuitry comprises: a first load circuit configured to load thefirst matrix on-chip from the off-chip memory; and a second load circuitconfigured to load the second matrix on-chip from the off-chip memory.3. The programmable logic device of claim 2, configured to: reducememory traffic by performing one or more transpositions, activationsfunctions or both within the pipeline to: mutate the first matrix, thesecond matrix, or both loaded from off-chip memory, wherein the firstload circuit, the second load circuit, or both; mutate the result of thesecond stage matrix multiplication prior to loading the result to theoff-chip memory; or both.
 4. The programmable logic device of claim 1,configured to: perform the training of the neural network, comprising amultilayer perception, via a two-pass execution by the PLD performed atsuccessive layers of the multilayer perception, comprising: a forwardpass that performs the matrix multiplication of the first matrix and thesecond matrix, wherein the first matrix comprises a current weightmatrix and the second matrix comprises a prior layer's output passedthrough an activation function; and a backward pass that computes agradient of the activation function and determines changes to be appliedto the current weight matrix.
 5. The programmable logic device of claim4, comprising: a stochastic gradient descent circuit configured toimplement the backward pass via stochastic gradient descent.
 6. Theprogrammable logic device of claim 1, configured to enhance off-memorymatrix access, by loading the second stage matrix multiplication to theoff-chip memory in an ordered manner, such that one or more sequences ofconsecutive addresses may be grouped in bursts for joint retrieval. 7.The programmable logic device of claim 1, comprising one or moresystolic arrays, wherein the second stage is configured to perform thematrix multiplication of the first matrix and the second matrix usingthe one or more systolic arrays.
 8. The programmable logic device ofclaim 7, wherein the one or more systolic arrays comprise: one or moreprocessing elements; and control logic for coordinating the one or moreprocessing elements.
 9. The programmable logic device of claim 8,comprising a row feeder, wherein the one or more processing elementscomprise a row of processing elements fed with at least a portion of thefirst matrix via the row feeder.
 10. The programmable logic device ofclaim 8, comprising a column feeder, wherein the one or more processingelements comprise a column of processing elements fed with at least aportion of the second matrix via the column feeder.
 11. The programmablelogic device of claim 8, wherein the one or more processing elementscomprise: a hybrid floating-point dot-product circuitry comprising botha hard floating-point multiplier and a soft floating-point multiplier.12. The programmable logic device of claim 11, comprising: one or moredelay registers between circuitry in the first stage and circuitry inthe second stage to counteract latency discrepancies between the hardfloating-point multiplier and the soft floating-point multiplier. 13.The programmable logic device of claim 11, wherein the circuitry in thesecond stage comprises the hard floating-point multiplier.
 14. Theprogrammable logic device of claim 11, wherein the one or moreprocessing elements comprise: an accumulation storage circuit configuredto: store intermediate results of the hybrid floating-point dot-productcircuitry; and selectively feed accumulated data back as input to thehybrid floating-point dot-product circuitry.
 15. An integrated circuit,comprising: a plurality of processing elements, arranged in rows ofprocessing elements and columns of processing elements, wherein each ofthe plurality of processing elements comprises a hybrid floating-pointdot-product circuitry comprising both a hard floating-point multiplierand a soft floating-point multiplier; and one or more delay registersbetween circuitry configured to counteract latency discrepancies betweenthe hard floating-point multiplier and the soft floating-pointmultiplier.
 16. The integrated circuit of claim 15, comprising: a rowfeeder configured to feed off-chip matrix data to a corresponding row ofthe rows of processing elements; and a column feeder configured to feedadditional off-chip matrix data to a corresponding column of the columnsof processing elements.
 17. The integrated circuit of claim 15, whereineach of the plurality of processing elements comprises an accumulationstorage circuitry configured to store intermediate results of the hybridfloating-point dot-product circuitry.
 18. A programmable logicdevice-implemented method, comprising: training a neural network, by: ina first stage of a pipeline, loading a first matrix and a second matrixfrom off-chip memory; in a second stage of the pipeline, performingmatrix multiplication of the first matrix and the second matrix; and ina third stage of the pipeline, loading a result of the second stagematrix multiplication to the off-chip memory.
 19. The programmable logicdevice-implemented method of claim 18, comprising: performing thetraining of the neural network, comprising a multilayer perception, viaa two-pass execution performed at successive layers of the multilayerperception, comprising: a forward pass that performs the matrixmultiplication of the first matrix and the second matrix, wherein thefirst matrix comprises a current weight matrix and the second matrixcomprises a prior layer's output passed through an activation function;and a backward pass that computes a gradient of the activation functionand determines changes to be applied to the current weight matrix. 20.The programmable logic device-implemented method of claim 19,comprising: implementing the backward pass via stochastic gradientdescent.